The data files are from two mobility companies. We are interested in learning more about the dynamics of the market in which they operate.

  1. Describe the operations of these two mobility companies.
  2. Please estimate the number of scooters company b operates in this geography? Simpler mechanisms of estimation are preferred

image.png

Important Python Libraries

In [1]:
import warnings
warnings.filterwarnings("ignore")

import glob
import pandas as pd

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns
In [2]:
files = glob.glob("*.csv")
print("There are {} data files (training data as well as testing data)".format(len(files)))
files
There are 9 data files (training data as well as testing data)
Out[2]:
['company_A.csv',
 'company_a_1.csv',
 'company_a_2.csv',
 'company_a_3.csv',
 'company_a_4.csv',
 'company_a_5.csv',
 'company_B.csv',
 'company_b_1.csv',
 'company_b_2.csv']

Data Loading

In [3]:
company_a_1 = pd.read_csv("company_a_1.csv")
company_a_1.head()
Out[3]:
TripID ScooterID StartTime EndTime StartLatitude StartLongitude EndLatitude EndLongitude TripDistance
0 0e56a534-66f4-448b-8eaa-beab76509955 5bc901d7-b714-4c57-a614-3d7193da825a 8/9/18 10:24 8/9/18 10:26 38.25 -85.76 38.25 -85.76 0.04
1 d9280fb0-8135-4750-9c71-c610789ba005 3d1905ec-85be-4773-9fe2-34421073363f 8/9/18 10:26 8/9/18 10:56 38.25 -85.74 38.26 -85.77 1.98
2 48fa27f3-3b19-4f0e-8665-ee0b188445fd 5bc901d7-b714-4c57-a614-3d7193da825a 8/9/18 10:26 8/9/18 10:35 38.25 -85.76 38.25 -85.76 0.06
3 6321b580-9055-4418-8c17-1d39a4d49027 4d49088f-4038-4dfb-9d55-e58a157d56d5 8/9/18 10:37 8/9/18 10:56 38.25 -85.74 38.26 -85.77 1.85
4 d05bf0bb-f6ee-489e-a32c-a93e15b626f8 e398ff1b-5d6b-4f15-b493-a988c7ab07b6 8/9/18 10:41 8/9/18 10:47 38.26 -85.75 38.26 -85.76 0.35
In [4]:
company_a_2 = pd.read_csv("company_a_2.csv")
company_a_2.head()
Out[4]:
TripID ScooterID StartTime EndTime StartLatitude StartLongitude EndLatitude EndLongitude TripDistance
0 2eb3f527-411e-4a28-800c-2d836c6a9bb8 ece4a0e2-3fea-4b8c-967e-31359a65bdbe 9/1/18 6:46 9/1/18 6:50 38.24 -85.72 38.24 -85.72 0.02
1 b387ded3-ba33-4152-b4e6-350720211f79 e5b153d7-5bbb-4619-a6eb-69135726d9d9 9/1/18 6:55 9/1/18 6:57 38.26 -85.75 38.25 -85.76 0.44
2 b6ef269f-5bca-4f41-9baf-fd3896194a8d 2c8461af-e2d3-4a92-b481-abf0ff8edb38 9/1/18 7:32 9/1/18 7:44 38.25 -85.76 38.26 -85.76 1.11
3 546b773d-1fba-4504-8330-4fa18256b607 f785de3f-526b-4427-8d9c-e819c3330b74 9/1/18 8:08 9/1/18 8:28 38.24 -85.72 38.24 -85.72 0.00
4 a2cd90e0-7a14-48f8-a29e-2aee76aefd68 c215c995-ec3d-4ab6-8dbd-3213c16f0866 9/1/18 8:14 9/1/18 8:18 38.26 -85.76 38.25 -85.76 0.37
In [5]:
company_a_3 = pd.read_csv("company_a_3.csv")
company_a_3.head()
Out[5]:
TripID ScooterID StartTime EndTime StartLatitude StartLongitude EndLatitude EndLongitude TripDistance
0 8401ec8a-9db4-4159-b1a5-a0838b2dfb6c 9885ab38-bb11-42bd-ad1d-e66336922fb5 9/30/18 4:52 9/30/18 5:14 38.24 -85.72 38.24 -85.72 0.23
1 f6c94abc-13b8-4b35-b18e-a0cae7434590 06e3db9e-b915-417e-b828-e987d9bb758e 9/30/18 6:14 9/30/18 6:16 38.25 -85.74 38.25 -85.74 0.00
2 8b8b5bdb-39c8-4539-ae2d-c1749ad0abad 7ba03898-7956-4f67-bd68-78c9dc99dc85 9/30/18 6:16 9/30/18 6:35 38.25 -85.74 38.23 -85.71 2.80
3 da33667c-a5c4-4d89-98e9-b6d7cdfd778f f5b1f620-de34-457d-8f0a-3ceaf87a9095 9/30/18 7:51 9/30/18 8:01 38.24 -85.72 38.25 -85.73 1.37
4 1c73a22e-412c-43f7-99b3-c5cdea0b96bc 5b7d485b-f1f6-4144-b834-86b16657ec00 9/30/18 8:04 9/30/18 8:06 38.25 -85.70 38.25 -85.70 0.04
In [6]:
company_a_4 = pd.read_csv("company_a_4.csv")
company_a_4.head()
Out[6]:
TripID ScooterID StartTime EndTime StartLatitude StartLongitude EndLatitude EndLongitude TripDistance
0 848ecb85-c9ce-496c-b679-81746bc4888d 84fa754c-6998-4f9b-9838-e0ccca8b0224 11/1/18 7:01 11/1/18 7:19 38.23 -85.71 38.24 -85.75 2.73
1 32b755f4-30db-4aef-8009-6b5a2c0b6140 3efc45f4-c8f6-44e1-98b1-f0eb851ab4c9 11/1/18 7:36 11/1/18 7:40 38.26 -85.76 38.26 -85.77 0.31
2 ae5e8995-a6cd-452d-89d0-7c47d27cfba4 da9afde1-f9a6-4469-8985-63b5d5fc9a7e 11/1/18 7:37 11/1/18 7:51 38.23 -85.75 38.25 -85.76 2.05
3 f576e1d4-a3c0-42f5-adef-0035640c3436 4c4ae433-d1a4-43e5-802b-c96adea54da9 11/1/18 7:46 11/1/18 7:50 38.25 -85.76 38.25 -85.76 0.19
4 d6981fa0-aeec-4252-a514-4ee6e4d1c4af fa7fd7c1-cf93-40c1-a0ec-44de02e4dab4 11/1/18 8:05 11/1/18 8:10 38.26 -85.76 38.26 -85.76 0.43
In [7]:
company_a_5 = pd.read_csv("company_a_5.csv")
company_a_5.head()
Out[7]:
TripID ScooterID StartTime EndTime StartLatitude StartLongitude EndLatitude EndLongitude TripDistance
0 4a1399f9-ea9b-4b23-b7a0-793ccf84ebff cf477adb-fb5f-4205-aac1-c86fe9f00598 12/1/18 13:13 12/1/18 13:14 38.22 -85.76 38.22 -85.76 0.00
1 05d0cd09-d04d-49b4-aa3e-69c3da216f53 167264ff-2aaa-4550-a045-f7278d38fc01 12/1/18 13:13 12/1/18 14:10 38.25 -85.76 38.25 -85.76 1.69
2 5560e705-8361-4dc8-895b-bc8a0a7f1895 a55d536e-709f-46e2-9365-207d9b242a1a 12/1/18 13:14 12/1/18 14:10 38.25 -85.76 38.25 -85.76 1.64
3 e4a0364f-b106-44bf-a405-08b29d29cf0c aaaa8807-9990-4929-98a2-f5d98712df6f 12/1/18 13:15 12/1/18 14:09 38.25 -85.76 38.25 -85.76 1.86
4 e2eea5dd-cc75-4545-8861-8690448c3aa0 62469d0d-84aa-4169-9bd6-a493ca21ab65 12/1/18 13:15 12/1/18 13:24 38.25 -85.76 38.24 -85.76 1.12
In [8]:
company_b_1 = pd.read_csv("company_b_1.csv")
company_b_1.head()
Out[8]:
trip_id start_time completed_time distance_meters start_latitude start_longitude end_latitude end_longitude
0 26559444 2018-12-02T12:41:26.000+00:00 2018-12-02T12:45:40.000+00:00 382 38.23 -85.75 38.23 -85.76
1 26559031 2018-12-02T12:39:24.000+00:00 2018-12-02T12:41:49.000+00:00 109 38.26 -85.75 38.26 -85.76
2 26552748 2018-12-02T12:08:07.000+00:00 2018-12-02T12:18:29.000+00:00 2547 38.21 -85.75 38.23 -85.75
3 26559594 2018-12-02T12:42:16.000+00:00 2018-12-02T12:56:58.000+00:00 1575 38.26 -85.76 38.26 -85.76
4 26557132 2018-12-02T12:30:01.000+00:00 2018-12-02T12:48:30.000+00:00 37 38.22 -85.75 38.23 -85.76
In [9]:
company_b_2 = pd.read_csv("company_b_2.csv")
company_b_2.head()
Out[9]:
trip_id start_time completed_time distance_meters start_latitude start_longitude end_latitude end_longitude
0 32246858 2019-01-03T00:26:04.000+00:00 2019-01-03T00:31:41.000+00:00 49 38.24 -85.76 38.24 -85.76
1 32246751 2019-01-03T00:24:00.000+00:00 2019-01-03T00:31:18.000+00:00 154 38.24 -85.76 38.24 -85.76
2 33040870 2019-01-07T13:33:19.000+00:00 2019-01-07T14:19:24.000+00:00 1695 38.21 -85.77 38.21 -85.77
3 33419326 2019-01-09T17:32:27.000+00:00 2019-01-09T17:38:20.000+00:00 690 38.25 -85.76 38.25 -85.75
4 34672477 2019-01-16T15:31:11.000+00:00 2019-01-16T15:32:22.000+00:00 0 38.26 -85.76 38.26 -85.76

Distances in test data is in meters, while training it is Kilometers, so convert distance to Kilometer first.

In [10]:
import geopy.distance
coords_1 = (38.23, -85.75)
coords_2 = (38.25, -85.76)
print(geopy.distance.geodesic(coords_1, coords_2).km)
2.3864013229747503

Combine Company A Data into One DataFrame

In [11]:
training_data = pd.concat([company_a_1, company_a_2, company_a_3, company_a_4, company_a_5])
training_data.shape
Out[11]:
(57365, 9)

Combine Company B Data into One DataFrame

In [12]:
testing_data = pd.concat([company_b_1, company_b_2])
testing_data.shape
Out[12]:
(10578, 8)

Convert start_time and completed_time to Date Type

In [13]:
training_data["StartTime"] = pd.to_datetime(training_data["StartTime"])
training_data["EndTime"] = pd.to_datetime(training_data["EndTime"])
testing_data["start_time"] = pd.to_datetime(testing_data["start_time"])
testing_data["completed_time"] = pd.to_datetime(testing_data["completed_time"])

Extract Year, Month, Day, Hour from Start Date columns

Also calculate trip duration using formula

$$tripDuration = time_{start} - time_{end}$$
In [14]:
training_data["year"] = training_data["StartTime"].dt.year
testing_data["year"] = testing_data["start_time"].dt.year
training_data["month"] = training_data["StartTime"].dt.month_name()
testing_data["month"] = testing_data["start_time"].dt.month_name()
training_data["day"] = training_data["StartTime"].dt.day_name()
testing_data["day"] = testing_data["start_time"].dt.day_name()
training_data["hour"] = training_data["StartTime"].dt.hour
testing_data["hour"] = testing_data["start_time"].dt.hour

training_data["tripDuration(minutes)"] = round((training_data["EndTime"] - training_data["StartTime"]).dt.total_seconds()/60)
testing_data["tripDuration(minutes)"] = round((testing_data["completed_time"] - testing_data["start_time"]).dt.total_seconds()/60)

training_data.head()
Out[14]:
TripID ScooterID StartTime EndTime StartLatitude StartLongitude EndLatitude EndLongitude TripDistance year month day hour tripDuration(minutes)
0 0e56a534-66f4-448b-8eaa-beab76509955 5bc901d7-b714-4c57-a614-3d7193da825a 2018-08-09 10:24:00 2018-08-09 10:26:00 38.25 -85.76 38.25 -85.76 0.04 2018 August Thursday 10 2.0
1 d9280fb0-8135-4750-9c71-c610789ba005 3d1905ec-85be-4773-9fe2-34421073363f 2018-08-09 10:26:00 2018-08-09 10:56:00 38.25 -85.74 38.26 -85.77 1.98 2018 August Thursday 10 30.0
2 48fa27f3-3b19-4f0e-8665-ee0b188445fd 5bc901d7-b714-4c57-a614-3d7193da825a 2018-08-09 10:26:00 2018-08-09 10:35:00 38.25 -85.76 38.25 -85.76 0.06 2018 August Thursday 10 9.0
3 6321b580-9055-4418-8c17-1d39a4d49027 4d49088f-4038-4dfb-9d55-e58a157d56d5 2018-08-09 10:37:00 2018-08-09 10:56:00 38.25 -85.74 38.26 -85.77 1.85 2018 August Thursday 10 19.0
4 d05bf0bb-f6ee-489e-a32c-a93e15b626f8 e398ff1b-5d6b-4f15-b493-a988c7ab07b6 2018-08-09 10:41:00 2018-08-09 10:47:00 38.26 -85.75 38.26 -85.76 0.35 2018 August Thursday 10 6.0

Unique Values

In [15]:
training_data.nunique()
Out[15]:
TripID                   57365
ScooterID                  858
StartTime                40304
EndTime                  38922
StartLatitude               20
StartLongitude              35
EndLatitude                 31
EndLongitude                41
TripDistance              1169
year                         1
month                        5
day                          7
hour                        18
tripDuration(minutes)      217
dtype: int64
In [16]:
testing_data.nunique()
Out[16]:
trip_id                  10578
start_time               10537
completed_time           10513
distance_meters           3640
start_latitude              14
start_longitude             20
end_latitude                25
end_longitude               31
year                         2
month                        2
day                          7
hour                        24
tripDuration(minutes)      127
dtype: int64

Missing Values Analysis

In [17]:
training_data.isna().sum()
Out[17]:
TripID                   0
ScooterID                0
StartTime                0
EndTime                  0
StartLatitude            0
StartLongitude           0
EndLatitude              1
EndLongitude             1
TripDistance             0
year                     0
month                    0
day                      0
hour                     0
tripDuration(minutes)    0
dtype: int64
In [18]:
testing_data.isna().sum()
Out[18]:
trip_id                  0
start_time               0
completed_time           0
distance_meters          0
start_latitude           0
start_longitude          0
end_latitude             0
end_longitude            0
year                     0
month                    0
day                      0
hour                     0
tripDuration(minutes)    0
dtype: int64
In [19]:
training_data.columns
Out[19]:
Index(['TripID', 'ScooterID', 'StartTime', 'EndTime', 'StartLatitude',
       'StartLongitude', 'EndLatitude', 'EndLongitude', 'TripDistance', 'year',
       'month', 'day', 'hour', 'tripDuration(minutes)'],
      dtype='object')

Check for Anamolies/ Outliers

  • We are looking for issues in datasets
  • There might be invalid values such as
    • Trip distance must be greater than ZERO
    • Trip Duration must be greater than ZERO
    • Very large values for duration and distance
  • There might be invalid locations such as latitude/ longitude of other state or country
In [20]:
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import numpy as np

def plotOutliers(data, c):
    print("Feature Name:", c)
    figure(figsize=(18, 8), dpi=80)
    plt.plot([i for i in range(len(data[c].values))], data[c].values)
    df = np.array(data[c])
    print("Highest allowed",np.mean(df) + 3* np.std(df))
    print("Lowest allowed",np.mean(df) - 3* np.std(df))
    H = np.mean(df) + 3* np.std(df)
    L = np.mean(df) - 3* np.std(df)
    values = df[(df > H) | (df < L)]
    
    if len(values) < 10:
        print(values)
    R = []
    for i in data[c].values:
        if i > H or i < L:
            R.append(i)
        else:
            R.append(data[c].mean())      
    plt.plot([i for i in range(len(data[c].values))], R, linestyle='None', marker="*", markersize=15, label="Outliers")
    plt.title(c)
    plt.legend()
    plt.show()
In [21]:
c = 'StartLatitude'
plotOutliers(training_data, c)
Feature Name: StartLatitude
Highest allowed 38.29183977563854
Lowest allowed 38.195000980920334
In [22]:
c = 'StartLongitude'
plotOutliers(training_data, c)
Feature Name: StartLongitude
Highest allowed -85.68430309136178
Lowest allowed -85.80672976839591
In [23]:
c = 'EndLatitude'
plotOutliers(training_data, c)
Feature Name: EndLatitude
Highest allowed nan
Lowest allowed nan
[]

There are some invalid locations i.e locations outside city.

image.png

In [24]:
invalid_locations = [-122.42, -111.87, -111.87, -111.87, 26.12]
training_data = training_data[~training_data["EndLatitude"].isin(invalid_locations)]
training_data = training_data[training_data["EndLatitude"] != 0]
In [25]:
c = 'EndLongitude'
plotOutliers(training_data, c)
Feature Name: EndLongitude
Highest allowed nan
Lowest allowed nan
[]
In [26]:
training_data = training_data[~training_data["EndLongitude"].isna()]
training_data = training_data[~training_data["EndLongitude"].isin(invalid_locations)]
training_data = training_data[training_data["EndLongitude"] != 0]
In [27]:
c = 'TripDistance'
plotOutliers(training_data, c)
Feature Name: TripDistance
Highest allowed 40716.177442501794
Lowest allowed -40417.89196602075
[1111734.31 1334335.28 1334377.66 1213527.47 1334328.7   822605.85
 1327569.19]
In [28]:
invalid_values = [1111734.31, 1334335.28, 1334377.66, 1213527.47, 1334328.7, 822605.85, 1327569.19]
training_data = training_data[~training_data["TripDistance"].isin(invalid_values)]
training_data = training_data[training_data["TripDistance"] > 0]
In [29]:
c = 'tripDuration(minutes)'
plotOutliers(training_data, c)
Feature Name: tripDuration(minutes)
Highest allowed 109.24739802159588
Lowest allowed -74.49156372161359
In [30]:
invalid_values = [3167, 2843]
training_data = training_data[~training_data["tripDuration(minutes)"].isin(invalid_values)]
training_data = training_data[training_data["tripDuration(minutes)"] > 0]
In [31]:
c = 'TripDistance'
plotOutliers(training_data, c)
Feature Name: TripDistance
Highest allowed 142.34470113316908
Lowest allowed -138.76840767713637
[ 1968.14 10374.71]
In [32]:
training_data = training_data[(training_data["TripDistance"] < 2000) & (training_data["TripDistance"] > 0)]

Pre-process Company B Data as well

In [33]:
c = 'start_latitude'
plotOutliers(testing_data, c)
Feature Name: start_latitude
Highest allowed 38.28828529502235
Lowest allowed 38.17840216952672
In [34]:
c = 'start_longitude'
plotOutliers(testing_data, c)
Feature Name: start_longitude
Highest allowed -85.71075142272339
Lowest allowed -85.79717067975345
In [35]:
c = 'end_latitude'
plotOutliers(testing_data, c)
Feature Name: end_latitude
Highest allowed 39.0923053831733
Lowest allowed 37.3732306349776
[33.42 48.84 26.12 48.86 48.86 37.34 39.53 19.42]

image.png

The regions shown in red circles are the outliers, because usually this is not possible to have cross-state rides or very far rides.

In [36]:
invalid_values = [26.12, 19.42]
testing_data = testing_data[~testing_data["end_latitude"].isin(invalid_values)]
testing_data = testing_data[testing_data["end_latitude"] != 0]
In [37]:
c = 'end_longitude'
plotOutliers(testing_data, c)
Feature Name: end_longitude
Highest allowed -80.99518715153191
Lowest allowed -90.48042933863452
[-111.94    2.36    2.32    2.33 -121.89 -119.81]
In [38]:
invalid_values = [-111.94, 2.36, 2.32, 2.33, -121.89, -119.81]
testing_data = testing_data[~testing_data["end_longitude"].isin(invalid_values)]
testing_data = testing_data[testing_data["end_longitude"] != 0]
In [39]:
c = 'tripDuration(minutes)'
plotOutliers(testing_data, c)
Feature Name: tripDuration(minutes)
Highest allowed 72.35456565096352
Lowest allowed -45.76781068407611
In [40]:
testing_data = testing_data[(testing_data["tripDuration(minutes)"] < 400) & (testing_data["tripDuration(minutes)"] > 0)]
In [41]:
c = 'distance_meters'
plotOutliers(testing_data, c)
Feature Name: distance_meters
Highest allowed 8116.772626352364
Lowest allowed -5125.6959666605235
In [42]:
testing_data = testing_data[(testing_data["distance_meters"] < 50000) & (testing_data["distance_meters"] > 0)]
In [43]:
training_data.to_csv("company_A.csv", index=False)
testing_data.to_csv("company_B.csv", index=False)

Analzying Source and Destination Points (Company A)

image.png

Note: Figure 1 Generated using Power BI - Left (Starting Points of Journey) - Right (Ending Points of Journey)

Analzying Source & Destination Points (Company B)

image.png

Note: Figure 2 Generated using Power BI - Left (Starting Points of Journey) - Right (Ending Points of Journey)

Company seems to have same region i-e their scooters are also in same region as company A. The only difference is company have complete city whereas company B has access in specific region (but not limited to that region).

Facts About This Location

Note: This external data can be used for deep and detailed analysis.

city = Louisville

county = Jefferson County

state = Kentucky

postcode = 40202

country = United States

https://censusreporter.org/profiles/05000US21111-jefferson-county-ky/

image.png

The population of mentioned state is good enough to target for company.

  • There are 2000 people per square mile which means that
    • In best case, if all use scooter, then you need 766000 scooters
    • If 50% of the population use scooters, then you need half of the above.

image.png

  • There are 61% (473000) of the population which are age between 18 to 64 years. They have high chances to use scooter. Rest are either children or elders which seldom use scooters. Hence the use of scooter reduced from 766000 to 473000 (if all people use scooter)
  • The native people are like 67% percent of the population. They have more chances to have own ride and avoid scooter. If we can 50% of this, the need of scooter reduced from 473000 to 238000 (which is 50% of 473000)

image.png

By checking the poverty index and the people who use public transport + walked + other, the need of scooters further reduced.

image.png

image.png

image.png

image.png

image.png

image.png

Compare both Datasets using Visual Analysis

In [44]:
def showHist(columnName1, columnName2):
    fig, a =  plt.subplots(1, 2, figsize=(20, 5))
    try:
        a[0].hist(training_data[columnName1])
        a[1].hist(testing_data[columnName2])
    except:
        training_data[columnName1].value_counts().head(20).plot(kind='barh', ax=a[0]).sort_index()
        testing_data[columnName2].value_counts().head(20).plot(kind='barh', ax=a[1]).sort_index()
    
    a[0].set_title(columnName1 + ' - Company A')
    a[1].set_title(columnName2 + ' - Company B')
    fig.tight_layout()
    plt.subplots_adjust(left=0.125,
                            bottom=0.1, 
                            right=0.9, 
                            top=0.9, 
                            wspace=0.4, 
                            hspace=0.35)
    plt.show()
In [45]:
showHist("StartLatitude", "start_latitude")
In [46]:
showHist("StartLongitude", "start_longitude")
In [47]:
showHist("EndLatitude", "end_latitude")
In [48]:
showHist("EndLongitude", "end_longitude")
In [49]:
training_data["distance_meters"] = training_data["TripDistance"] * 1000

testing_data["distance_meters"].min()
Out[49]:
1
In [50]:
showHist("distance_meters", "distance_meters")
In [51]:
showHist("year", "year")
In [52]:
showHist("day", "day")
In [53]:
showHist("hour", "hour")
In [54]:
showHist("month", "month")
In [55]:
showHist("tripDuration(minutes)", "tripDuration(minutes)")

Observations

By looking at above histograms, I have concluded following

  • The Company A data is of year 2018 and company B is of 2018 and 2019
    • Data for comparison should be of same period (Take only 2018 Data)
  • The data of company A is of months (August, September, October, Nov, and December) while the data of company B is of months (Jan and December)
    • We need to take common data for comparison i-e take data of month December
In [56]:
print("Look at the Year Wise Data - Company A")
print(training_data["year"].value_counts())

print("\n\nLook at the Year Wise Data - Company B")
print(testing_data["year"].value_counts())
Look at the Year Wise Data - Company A
2018    50855
Name: year, dtype: int64


Look at the Year Wise Data - Company B
2019    6183
2018    3964
Name: year, dtype: int64
In [57]:
print('For analysis we need to get data of same year as Company A')
testing_data = testing_data[testing_data["year"] == 2018]
For analysis we need to get data of same year as Company A
In [58]:
print("Take December Data from both Datasets")
training_data = training_data[training_data["month"] == "December"]
testing_data = testing_data[testing_data["month"] == "December"]
Take December Data from both Datasets
In [59]:
print('There are {} records in Company A'.format(training_data.shape[0]))
print('There are {} records in Company B'.format(testing_data.shape[0]))
There are 7819 records in Company A
There are 3964 records in Company B
In [60]:
training_data.nunique()
Out[60]:
TripID                   7819
ScooterID                 378
StartTime                6111
EndTime                  5835
StartLatitude              15
StartLongitude             23
EndLatitude                19
EndLongitude               22
TripDistance              619
year                        1
month                       1
day                         7
hour                       18
tripDuration(minutes)     128
distance_meters           619
dtype: int64
In [61]:
scoters_trips = training_data[["ScooterID", "TripID"]].groupby(["ScooterID"]).count().reset_index()
scoters_trips = scoters_trips.sort_values(by="TripID", ascending=False)
scoters_trips
Out[61]:
ScooterID TripID
189 7ce2fe71-f7f3-41d7-8ad2-f2c9f687342e 55
217 8f75885e-ca1c-4521-ad18-095cec7dae85 55
39 167264ff-2aaa-4550-a045-f7278d38fc01 54
348 ec1618c8-6f9b-4bda-9694-c17977762699 50
129 4f02884d-bd54-4715-a81b-3a6a5fe3b080 50
... ... ...
16 0b67c9de-6ecd-42d3-8dbe-4fa9eddcef97 1
33 14204783-65f5-42c2-a546-78d6bcaf2b4a 1
233 9a7bbfb0-06b3-43a5-b668-9f937c71fc7d 1
267 b028789b-fbf7-46f6-b5c8-f057c075f107 1
261 ac4967d2-32bd-47b2-bcb7-5f1d067e8451 1

378 rows × 2 columns

In [62]:
print("There are total {} scooters of company A".format(scoters_trips["ScooterID"].nunique()))

average_trips_per_scooter = round(training_data["TripID"].nunique() / scoters_trips["ScooterID"].nunique())

print("On Average {} trips are done by single scooter".format(average_trips_per_scooter))
There are total 378 scooters of company A
On Average 21 trips are done by single scooter
In [63]:
month_trips_a = training_data[["month", "TripID", "ScooterID"]].groupby(["month"]).nunique().reset_index()
month_trips_a = month_trips_a.sort_values(by="TripID", ascending=False)
print("Monthly Trips")
print("On Average company A have {} trips".format(round(month_trips_a["TripID"].mean())))
print("On Average company A have {} active scooters".format(round(month_trips_a["ScooterID"].mean())))
Monthly Trips
On Average company A have 7819 trips
On Average company A have 378 active scooters
In [64]:
day_trips_a = training_data[["month","day","TripID", "ScooterID"]].groupby(["month","day"]).nunique().reset_index()
day_trips_a = day_trips_a.sort_values(by="TripID", ascending=False)
print("Daily Trips")
print("On Average company A have {} trips".format(round(day_trips_a["TripID"].mean())))
print("On Average company A have {} active scooters".format(round(day_trips_a["ScooterID"].mean())))
Daily Trips
On Average company A have 1117 trips
On Average company A have 288 active scooters
In [65]:
hourly_trips_a = training_data[["month","day", "hour","TripID", "ScooterID"]].groupby(["month","day","hour"]).nunique().reset_index()
hourly_trips_a = hourly_trips_a.sort_values(by="TripID", ascending=False)
print("Hourly Trips")
print("On Average company A have {} trips".format(round(hourly_trips_a["TripID"].mean())))
print("On Average company A have {} active scooters".format(round(hourly_trips_a["ScooterID"].mean())))
Hourly Trips
On Average company A have 65 trips
On Average company A have 52 active scooters
In [66]:
month_trips_b = testing_data[["month", "trip_id"]].groupby(["month"]).nunique().reset_index()
month_trips_b = month_trips_b.sort_values(by="trip_id", ascending=False)
print("Monthly Trips")
print("On Average company B have {} trips".format(round(month_trips_b["trip_id"].mean())))
Monthly Trips
On Average company B have 3964 trips
In [67]:
day_trips_b = testing_data[["month","day", "trip_id"]].groupby(["month","day"]).nunique().reset_index()
print("Daily Trips")
print("On Average company B have {} trips".format(round(day_trips_b["trip_id"].mean())))
Daily Trips
On Average company B have 566 trips
In [68]:
hourly_trips_b = testing_data[["month","day", "hour","trip_id"]].groupby(["month","day","hour"]).nunique().reset_index()
hourly_trips_b = hourly_trips_b.sort_values(by="trip_id", ascending=False)
print("Hourly Trips")
print("On Average company B have {} trips".format(round(hourly_trips_b["trip_id"].mean())))
Hourly Trips
On Average company B have 24 trips

Comparative Analysis

I am calculating different features for Company A and same features for Company B. Then finding ration of feature of company B to A using formula;

$$ratio_{feature_i} = \frac {Feature_{companyB}} {Feature_{companyA}}$$

Then for calculating average of all ratios using

$$values = \frac {ratio_i} N$$

Where N is the number of ratios.

The final comparative constant is calculated using weighted average; i.e

$$constant = \sum_{i=1}^N {values_i * weight_i}$$

I have kept following weights;

  • 0.3 for all averages_values (avg_of_avg)
  • 0.2 for all maximum_values (avg_of_maximum)
  • 0.2 for all minimum_values (avg_of_minimum)
  • 0.3 for all sum values (avg_of_sum)

Note: These weight values are calculated on the basis of optimization i-e hit and trial method used to calculate optimal value

In [69]:
print("Company A - Calculations")
print("-------------------------")
a_trips = training_data["TripID"].nunique()

a_trip_distance_minimum = training_data["distance_meters"].min()
a_trip_distance_maximum = training_data["distance_meters"].max()
a_trip_distance_average = training_data["distance_meters"].mean()
a_trip_distance_sum = training_data["distance_meters"].sum()

a_trip_duration_minimum = training_data["tripDuration(minutes)"].min()
a_trip_duration_maximum = training_data["tripDuration(minutes)"].max()
a_trip_duration_average = training_data["tripDuration(minutes)"].mean()
a_trip_duration_sum = training_data["tripDuration(minutes)"].sum()

a_average_monthly_trip = round(month_trips_a["TripID"].mean())
a_minimum_monthly_trip = round(month_trips_a["TripID"].min())
a_maximum_monthly_trip = round(month_trips_a["TripID"].max())

a_average_daily_trip = round(day_trips_a["TripID"].mean())
a_minimum_daily_trip = round(day_trips_a["TripID"].min())
a_maximum_daily_trip = round(day_trips_a["TripID"].max())

a_average_hourly_trip = round(hourly_trips_a["TripID"].mean())
a_minimum_hourly_trip = round(hourly_trips_a["TripID"].min())
a_maximum_hourly_trip = round(hourly_trips_a["TripID"].max())
Company A - Calculations
-------------------------
In [70]:
print("Company B - Calculations")
print("-------------------------")
b_trips = testing_data["trip_id"].nunique()

b_trip_distance_minimum = testing_data["distance_meters"].min()
b_trip_distance_maximum = testing_data["distance_meters"].max()
b_trip_distance_average = testing_data["distance_meters"].mean()
b_trip_distance_sum = testing_data["distance_meters"].sum()

b_trip_duration_minimum = testing_data["tripDuration(minutes)"].min()
b_trip_duration_maximum = testing_data["tripDuration(minutes)"].max()
b_trip_duration_average = testing_data["tripDuration(minutes)"].mean()
b_trip_duration_sum = testing_data["tripDuration(minutes)"].sum()

b_average_monthly_trip = round(month_trips_b["trip_id"].mean())
b_minimum_monthly_trip = round(month_trips_b["trip_id"].min())
b_maximum_monthly_trip = round(month_trips_b["trip_id"].max())

b_average_daily_trip = round(day_trips_b["trip_id"].mean())
b_minimum_daily_trip = round(day_trips_b["trip_id"].min())
b_maximum_daily_trip = round(day_trips_b["trip_id"].max())

b_average_hourly_trip = round(hourly_trips_b["trip_id"].mean())
b_minimum_hourly_trip = round(hourly_trips_b["trip_id"].min())
b_maximum_hourly_trip = round(hourly_trips_b["trip_id"].max())
Company B - Calculations
-------------------------
In [71]:
b_a_trip_ratio = b_trips / a_trips

b_a_distance_min_ratio = b_trip_distance_minimum / a_trip_distance_minimum
b_a_distance_max_ratio = b_trip_distance_maximum / a_trip_distance_maximum
b_a_distance_avg_ratio = b_trip_distance_average / a_trip_distance_average
b_a_distance_sum_ratio = b_trip_distance_sum / a_trip_distance_sum

b_a_duration_min_ratio = b_trip_duration_minimum / a_trip_duration_minimum
b_a_duration_max_ratio = b_trip_duration_maximum / a_trip_duration_maximum
b_a_duration_avg_ratio = b_trip_duration_average / a_trip_duration_average
b_a_duration_sum_ratio = b_trip_duration_sum / a_trip_duration_sum

b_a_average_monthly_trip = b_average_monthly_trip / a_average_monthly_trip
b_a_minimum_monthly_trip = b_minimum_monthly_trip / a_minimum_monthly_trip
b_a_maximum_monthly_trip = b_maximum_monthly_trip / a_maximum_monthly_trip

b_a_average_daily_trip = b_average_daily_trip / a_average_daily_trip
b_a_minimum_daily_trip = b_minimum_daily_trip / a_minimum_daily_trip
b_a_maximum_daily_trip = b_maximum_daily_trip / a_maximum_daily_trip

b_a_average_hourly_trip = b_average_hourly_trip / a_average_hourly_trip
b_a_minimum_hourly_trip = b_minimum_hourly_trip / a_minimum_hourly_trip
b_a_maximum_hourly_trip = b_maximum_hourly_trip / a_maximum_hourly_trip

print("Calculate Ratios")
print("-----------------------------------------------")
print("Number of Trips = ", b_a_trip_ratio)
print("-----------------------------------------------")
print("Minimum Distance Ratio = ", b_a_distance_min_ratio)
print("Maximum Distance Ratio = ", b_a_distance_max_ratio)
print("Average Distance Ratio = ", b_a_distance_avg_ratio)
print("Sum Distance Ratio = ", b_a_distance_sum_ratio)
print("-----------------------------------------------")

print("Minimum Duration Ratio = ", b_a_duration_min_ratio)
print("Maximum Duration Ratio = ", b_a_duration_max_ratio)
print("Average Duration Ratio = ", b_a_duration_avg_ratio)
print("Sum Duration Ratio = ", b_a_duration_sum_ratio)
print("-----------------------------------------------")
print("Minimum Monthly Trips Ratio = ", b_a_average_monthly_trip)
print("Maximum Monthly Trips Ratio = ", b_a_minimum_monthly_trip)
print("Average Monthly Trips Ratio = ", b_a_maximum_monthly_trip)
print("-----------------------------------------------")
print("Minimum Daily Trips Ratio = ", b_a_average_daily_trip)
print("Maximum Daily Trips Ratio = ", b_a_minimum_daily_trip)
print("Average Daily Trips Ratio = ", b_a_maximum_daily_trip)
print("-----------------------------------------------")
print("Minimum Hourly Trips Ratio = ", b_a_average_hourly_trip)
print("Maximum Hourly Trips Ratio = ", b_a_minimum_hourly_trip)
print("Average Hourly Trips Ratio = ", b_a_maximum_hourly_trip)
Calculate Ratios
-----------------------------------------------
Number of Trips =  0.5069702007929403
-----------------------------------------------
Minimum Distance Ratio =  0.1
Maximum Distance Ratio =  0.2745020576131687
Average Distance Ratio =  1.6216675365277475
Sum Distance Ratio =  0.8221371166128649
-----------------------------------------------
Minimum Duration Ratio =  1.0
Maximum Duration Ratio =  0.5695142378559463
Average Duration Ratio =  1.2345980404122936
Sum Duration Ratio =  0.625904416446391
-----------------------------------------------
Minimum Monthly Trips Ratio =  0.5069702007929403
Maximum Monthly Trips Ratio =  0.5069702007929403
Average Monthly Trips Ratio =  0.5069702007929403
-----------------------------------------------
Minimum Daily Trips Ratio =  0.5067144136078783
Maximum Daily Trips Ratio =  0.473224043715847
Average Daily Trips Ratio =  0.6415094339622641
-----------------------------------------------
Minimum Hourly Trips Ratio =  0.36923076923076925
Maximum Hourly Trips Ratio =  1.0
Average Hourly Trips Ratio =  0.553763440860215

Use Ratios of Company B to estimate Scooters in Company B

In [72]:
avg_of_minimum = (b_a_distance_min_ratio + b_a_duration_min_ratio + b_a_minimum_monthly_trip + b_a_minimum_daily_trip + b_a_minimum_hourly_trip)/5

avg_of_maximum = (b_a_distance_max_ratio + b_a_duration_max_ratio + b_a_maximum_monthly_trip + b_a_maximum_daily_trip + b_a_maximum_hourly_trip)/5

avg_of_sum = (b_a_distance_sum_ratio + b_a_duration_sum_ratio) /2

avg_of_avg = (b_a_distance_avg_ratio + b_a_duration_avg_ratio + b_a_average_monthly_trip + b_a_average_daily_trip + b_a_average_hourly_trip)/5

avg_of_avg, avg_of_maximum, avg_of_minimum, avg_of_sum
Out[72]:
(0.8478361921143257, 0.5092518742169069, 0.6160388489017575, 0.724020766529628)
In [78]:
overall_average = (avg_of_avg*0.3 + avg_of_maximum*0.2 + avg_of_minimum*0.2 + avg_of_sum*0.3)
overall_average
Out[78]:
0.696615232216919
In [79]:
print("Estimated Number of Trips of Company B = ", round(training_data["TripID"].nunique()*overall_average))
print("Actual Number of Trips of Company B = ", testing_data["trip_id"].nunique())
Estimated Number of Trips of Company B =  5447
Actual Number of Trips of Company B =  3964
In [82]:
print("Estimated Average Trip Duration of Company B = ", round(training_data["tripDuration(minutes)"].mean()*overall_average))
print("Actual Average Trip Duration of Company B = ", round(testing_data["tripDuration(minutes)"].mean()))
Estimated Average Trip Duration of Company B =  9
Actual Average Trip Duration of Company B =  16
In [83]:
print("Estimated Average Distance Covered of Company B = ", round(training_data["distance_meters"].mean()*overall_average))
print("Actual Average Distance Covered of Company B = ", round(testing_data["distance_meters"].mean()))
Estimated Average Distance Covered of Company B =  784
Actual Average Distance Covered of Company B =  1825
In [77]:
round(training_data["ScooterID"].nunique()*overall_average)
Out[77]:
259

Validation & Conclusion

In [89]:
trip_ratio = testing_data["trip_id"].nunique() / training_data["TripID"].nunique()
print('The Ratio of Trips of Company B to A = ', trip_ratio)

duration_ratio = testing_data["tripDuration(minutes)"].mean() / training_data["tripDuration(minutes)"].mean()
print('The Ratio of Average Duration of Company B to A = ', duration_ratio)

duration_ratio = testing_data["distance_meters"].mean() / training_data["distance_meters"].mean()
print('The Ratio of Average Distance Covered of Company B to A = ', duration_ratio)
The Ratio of Trips of Company B to A =  0.5069702007929403
The Ratio of Average Duration of Company B to A =  1.2345980404122936
The Ratio of Average Distance Covered of Company B to A =  1.6216675365277475

The number of trips by Company A's scooters are almost double of company B's scooters, because Company B's scooters usually cover long distance and the trip duration is near to double of Company A. Therefor we can conclude that the number of scooters of company B is same region will less than company A, hence

The Estimated Number of Scooters of Company B are Approx 260

image.png

Company A

Figure 3: (generated using PowerBI) - Left (Starting Points of Journey) - Right (Ending Points of Journey)

image.png

Company B

Figure 4: (generated using PowerBI) - Left (Starting Points of Journey) - Right (Ending Points of Journey)

Summary to a business stakeholder describing your findings

  • Trips of Company A are greater than (on hourly, daily, monthly basis) as compared to company B. This is due to fact that the trip duration and distance in company B is greater than A, i-e the scooters of B cover long distance.
  • There is proper need of properly resource scheduling, such as in my analysis you can see that some regions are dense (more source and destinations of trips) as compared to other regions, hence the availability should be made sure in those regions. Check in figure A. Red, Green and Blue regions have high density, hence make sure that you have scooters available here to have high return on investment.

image.png

Figure A
  • We should not ride scooters randomly in this region/ county/ state, instead define terroritries. For reference check Image B. Then do scheduling according to each territory.

image.png

Image B

Both of above stratagies will help us to reduce response and turn around time (time required to move from source to destination and then back).

image.png

In [ ]: